Make qwen3's set_embed_and_head idempotent#26798
Conversation
Guard the weight deletions in `Qwen3ForCausalLM.set_embed_and_head` with `hasattr` checks so the method is idempotent: calling it when `embed_tokens.weight` / `lm_head.weight` have already been deleted (e.g. a second invocation, or after a prior tie/share step) no longer raises `AttributeError`.
There was a problem hiding this comment.
Code Review
This pull request modifies the set_embed_and_head method in python/sglang/srt/models/qwen3.py to check for the existence of the weight attribute before deleting it. The reviewer suggested a more robust implementation to handle pipeline parallelism and tied word embeddings, recommending that the code check if the layers are instances of PPMissingLayer and avoid redundant operations when self.lm_head is the same object as self.model.embed_tokens to prevent unnecessary memory usage.
| if hasattr(self.model.embed_tokens, "weight"): | ||
| del self.model.embed_tokens.weight | ||
| if hasattr(self.lm_head, "weight"): | ||
| del self.lm_head.weight | ||
| self.model.embed_tokens.weight = embed | ||
| self.lm_head.weight = head |
There was a problem hiding this comment.
When pipeline parallelism is enabled (pp_size > 1), ranks other than the first rank will have self.model.embed_tokens as a PPMissingLayer, and ranks other than the last rank will have self.lm_head as a PPMissingLayer.
Additionally, when tie_word_embeddings is enabled, self.lm_head is the exact same object as self.model.embed_tokens.
The current implementation:
- Unnecessarily assigns
weighttoPPMissingLayerplaceholders on ranks where they are not used, which can prevent the largeembedorheadtensors from being garbage collected on those ranks. - Performs redundant deletion and setting operations when weights are tied.
We can make this method fully robust and memory-efficient by checking if the modules are instances of PPMissingLayer and ensuring we don't perform redundant operations when self.lm_head is self.model.embed_tokens.
| if hasattr(self.model.embed_tokens, "weight"): | |
| del self.model.embed_tokens.weight | |
| if hasattr(self.lm_head, "weight"): | |
| del self.lm_head.weight | |
| self.model.embed_tokens.weight = embed | |
| self.lm_head.weight = head | |
| if not isinstance(self.model.embed_tokens, PPMissingLayer): | |
| if hasattr(self.model.embed_tokens, "weight"): | |
| del self.model.embed_tokens.weight | |
| self.model.embed_tokens.weight = embed | |
| if not isinstance(self.lm_head, PPMissingLayer) and self.lm_head is not self.model.embed_tokens: | |
| if hasattr(self.lm_head, "weight"): | |
| del self.lm_head.weight | |
| self.lm_head.weight = head |
|
Reviewed for correctness — looks good to me. ✅ The change guards the two
Minor (non-blocking, out of scope): the same Note: both PR Test CI badges are currently red; worth confirming the failures are unrelated to this trivial change before merging. |
Guard the weight deletions in
Qwen3ForCausalLM.set_embed_and_headwithhasattrchecks so the method is idempotent: calling it whenembed_tokens.weight/lm_head.weighthave already been deleted (e.g. asecond invocation, or after a prior tie/share step) no longer raises
AttributeError.CI States
Latest PR Test (Base): ❌ Run #26700294436
Latest PR Test (Extra): ❌ Run #26700294381